{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Isi Penting Generator news style"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate a long text with news style given isi penting (important facts)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/isi-penting-generator-news-style](https://github.com/huseinzol05/Malaya/tree/master/example/isi-penting-generator-news-style).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "Results generated using stochastic methods.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.77 s, sys: 3.95 s, total: 6.72 s\n",
      "Wall time: 1.98 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya\n",
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available HuggingFace"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mesolitica/finetune-isi-penting-generator-t5-small-standard-bahasa-cased': {'Size (MB)': 242,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024},\n",
       " 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased': {'Size (MB)': 892,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024}}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.generator.isi_penting.available_huggingface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load HuggingFace\n",
    "\n",
    "Transformer Generator in Malaya is quite unique, most of the text generative model we found on the internet like GPT2 or Markov, simply just continue prefix input from user, but not for Transformer Generator. We want to generate an article or karangan like high school when the users give 'isi penting'.\n",
    "\n",
    "```python\n",
    "def huggingface(\n",
    "    model: str = 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased',\n",
    "    force_check: bool = True,\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load HuggingFace model to generate text based on isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')\n",
    "        Check available models at `malaya.generator.isi_penting.available_huggingface`.\n",
    "    force_check: bool, optional (default=True)\n",
    "        Force check model one of malaya model.\n",
    "        Set to False if you have your own huggingface model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.torch_model.huggingface.IsiPentingGenerator\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n",
      "You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
     ]
    }
   ],
   "source": [
    "model = malaya.generator.isi_penting.huggingface()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### generate\n",
    "\n",
    "```python\n",
    "def generate(\n",
    "    self,\n",
    "    strings: List[str],\n",
    "    mode: str = 'surat-khabar',\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    generate a long text given a isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings : List[str]\n",
    "    mode: str, optional (default='surat-khabar')\n",
    "        Mode supported. Allowed values:\n",
    "\n",
    "        * ``'surat-khabar'`` - news style writing.\n",
    "        * ``'tajuk-surat-khabar'`` - headline news style writing.\n",
    "        * ``'artikel'`` - article style writing.\n",
    "        * ``'penerangan-produk'`` - product description style writing.\n",
    "        * ``'karangan'`` - karangan sekolah style writing.\n",
    "\n",
    "    **kwargs: vector arguments pass to huggingface `generate` method.\n",
    "        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Good thing about HuggingFace\n",
    "\n",
    "In `generate` method, you can do greedy, beam, sampling, nucleus decoder and so much more, read it at https://huggingface.co/blog/how-to-generate\n",
    "\n",
    "And recently, huggingface released https://huggingface.co/blog/introducing-csearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Dr M perlu dikekalkan sebagai perdana menteri',\n",
    "              'Muhyiddin perlulah menolong Dr M',\n",
    "              'rakyat perlu menolong Muhyiddin']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['TAN Sri Muhyiddin Yassin perlu dikekalkan sebagai perdana menteri berikutan '\n",
      " 'pengunduran Datuk Seri Dr Wan Azizah Wan Ismail daripada jawatan timbalan '\n",
      " 'perdana menteri pada 24 Februari lalu. Dalam satu kenyataan hari ini, '\n",
      " 'Setiausaha Agung DAP, Lim Guan Eng berkata ini kerana Muhyiddin merupakan '\n",
      " 'salah seorang pemimpin kanan Pakatan Harapan (PH) yang sedang berdepan '\n",
      " 'cabaran. \"Sebagai \\'pemimpin kiri\\', Dr Wan Azizah adalah pilihan pertama '\n",
      " 'dalam sejarah negara untuk menjawat jawatan timbalan perdana menteri. \"Oleh '\n",
      " 'itu, Dr Wan Azizah wajib dikekalkan sebagai timbalan perdana menteri sebagai '\n",
      " 'tanggungjawab yang perlu dilaksanakan oleh Muhyiddin,\" katanya. Guan Eng '\n",
      " 'berkata, Dr Mahathir tidak sepatutnya meletak jawatan, sebaliknya beliau '\n",
      " 'perlu menggalas tanggungjawab tersebut bagi memastikan kesejahteraan dan '\n",
      " 'kemakmuran rakyat dapat dirasai. \"Dalam keadaan negara sedang dalam '\n",
      " 'pemulihan, keadaan sekarang adalah masa terbaik untuk Dr Mahathir, dan dalam '\n",
      " 'masa sama, meneruskan agendanya dan memberi tumpuan kepada usaha '\n",
      " 'membangunkan ekonomi negara. \"Oleh yang demikian, tidak ada apa yang harus '\n",
      " 'dibimbangkan bahawa Dr Mahathir akan mengkhianatinya untuk memastikan '\n",
      " 'kestabilan politik negara dan rakyat,\" katanya. Beliau juga mempersoalkan '\n",
      " 'keabsahan kepimpinan Dr Mahathir yang dianggapnya gagal. Menurutnya, Dr '\n",
      " 'Mahathir juga gagal mengawal kerajaan PN secara teratur. \"Dengan cara ini, '\n",
      " 'Dr Mahathir tidak akan boleh menjadi PM yang sah temperatures, sekali']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    top_k=50, \n",
    "    top_p=0.95,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Perdana Menteri maritimes, Muhyiddin Yassin harus mengekalkan jawatan '\n",
      " 'sebagai Perdana Menteri sehingga pilihan raya umum ke-15 (PRU15), kata bekas '\n",
      " 'timbalannya, Dr Wan Ahmad Fayhsal Wan Ahmad Kamal. Menurut beliau, ini '\n",
      " 'kerana rakyat perlu membantu Muhyiddin dalam usaha memulihkan ekonomi '\n",
      " 'negara. \"Kalau kita nak bantu (kestabilan negara dan ekonomi), kita kena '\n",
      " 'jagalah. \"Jadi, kita kena bantu (kestabilan negara dan ekonomi). Kalau nak '\n",
      " 'bantu (kestabilan), kita kena jagalah,\" katanya kepada media selepas program '\n",
      " \"'Jom Jumpa Kasih #JomJibur' di sini, hari ini. Beliau mengulas mengenai \"\n",
      " 'kenyataan Presiden PKR, Anwar Ibrahim yang mahu Dr Mahathir dikekalkan '\n",
      " 'sebagai perdana menteri sehingga PRU15. Anwar dilaporkan berkata, perkara '\n",
      " 'itu perlu dibincangkan dengan pemimpin parti dalam Pakatan Harapan (PH).']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Neelofa tetap dengan keputusan untuk berkahwin akhir tahun ini',\n",
    "              'Long Tiger sanggup membantu Neelofa',\n",
    "              'Tiba-tiba Long Tiger bergaduh dengan Husein Zolkepli']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also can give any isi penting even does not make any sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['PEMINAT drama bersiri, Gegar Vaganza 2 (GV2), Neelofa tetap dengan keputusan '\n",
      " 'untuk berkahwin akhir tahun ini meskipun pernah diburu oleh aktor terkenal, '\n",
      " 'Yusry Abdul Halim. Namun, apabila menerima khabar duka duka dari teman '\n",
      " 'rapatnya, aktres cantik dan juga pengacara popular, Datuk Husein Zolkepli, '\n",
      " 'Long Tiger sanggup membantu Neelofa. Tiba-tiba Long Tiger bergaduh dengan '\n",
      " 'Husein. Menerusi video yang dikongsikan di media sosial, Husein turut '\n",
      " 'kelihatan berbual mesra dengan Neelofa. BACA: \"Ikut perancangan, saya nak '\n",
      " 'jadi orang ketiga\" - Neelofa BACA: \"Saya dah lama bergaduh dengan wanita '\n",
      " 'cantik\" - Husein Zolkepli \\'panas\\'...\" - Neelofa \\'panas\\'...\" - Datuk '\n",
      " 'Husein Zolkepli \\'panas\\'...\" - Neelofa \\'panas\\'...\" - Datuk Husein '\n",
      " 'Zolkepli \\'panas\\'...\" - Datuk Husein Zolkepli \\'panas\\'...']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, do_sample=True,\n",
    "    max_length=256,\n",
    "    top_k=50, \n",
    "    top_p=0.9, ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['PELAKON dan pengacara, Neelofa tetap dengan keputusannya untuk berkahwin '\n",
      " 'akhir tahun ini. Menerusi entri di Instagram (IG), Neelofa menjelaskan dia '\n",
      " 'masih lagi dalam proses untuk mendapatkan jodoh. \"Semalam saya sudah kahwin '\n",
      " 'dengan seorang wanita bernama Husein Zolkepli,\" tulisnya ringkas. Tiba-tiba '\n",
      " 'Long Tiger bergaduh dengan Husein. Long Tiger yang sedang bergaduh dengan '\n",
      " 'Husein. #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya '\n",
      " '#salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya '\n",
      " '#salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya '\n",
      " '#salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya #salamsaya '\n",
      " '#salamsaya #salamsaya #salamsaya #salamsaya #salamsaya']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
